GH-50007: [C++][Parquet] Add bloom filter folding to automatically size SBBF filters#50008
GH-50007: [C++][Parquet] Add bloom filter folding to automatically size SBBF filters#50008HuaHuaY wants to merge 6 commits into
Conversation
|
|
| std::map</*column_id=*/int32_t, std::shared_ptr<BloomFilter>>; | ||
| struct RowGroupBloomFilters { | ||
| RowGroupBloomFilters() = default; | ||
| RowGroupBloomFilters(RowGroupBloomFilters&&) noexcept = default; |
There was a problem hiding this comment.
I need these to prevent MSVC from attempting to instantiate the copy constructor. See microsoft/STL#5552 and microsoft/STL#5084.
There was a problem hiding this comment.
Or we just keep using std::shared_ptr instead of std::unique_ptr. Is it important here?
|
@wgtmac @alamb @etseidl @emkornfield Please take a look. |
|
I am not likely to have time to review C++ code in the arrow repository unfortunately |
| std::to_string(bloom_filter_options.fpp)); | ||
| } | ||
| if (bloom_filter_options.ndv.has_value() && bloom_filter_options.ndv.value() < 0) { | ||
| throw ParquetException("Bloom filter number of distinct values must be >= 0, got " + |
There was a problem hiding this comment.
What is the expected behavior of 0?
There was a problem hiding this comment.
It will create a smallest bloom filter.
wgtmac
left a comment
There was a problem hiding this comment.
Generally LGTM. I left some nits.
| const double avg_fill = static_cast<double>(total_set_bits) / | ||
| (static_cast<double>(num_blocks) * kBytesPerFilterBlock * 8); |
There was a problem hiding this comment.
More simply
| const double avg_fill = static_cast<double>(total_set_bits) / | |
| (static_cast<double>(num_blocks) * kBytesPerFilterBlock * 8); | |
| const double avg_fill = static_cast<double>(total_set_bits) / (num_bytes_ * 8); |
| DCHECK_GT(num_folds, 0); | ||
|
|
||
| const uint32_t num_blocks = NumBlocks(); | ||
| const uint32_t group_size = UINT32_C(1) << num_folds; |
There was a problem hiding this comment.
Can you add comments? It's not obvious what a "group size" is.
| } | ||
| } | ||
|
|
||
| num_bytes_ = new_num_blocks * kBytesPerFilterBlock; |
There was a problem hiding this comment.
data_ is now oversized, would it be useful to shrink it here?
| } | ||
| ++num_folds; | ||
| } | ||
| return num_folds; |
There was a problem hiding this comment.
With this algorithm the actual size reduction will always be a power of 2 (group_size = UINT32_C(1) << num_folds). Why aren't we trying to be more granular?
| std::map</*column_id=*/int32_t, std::shared_ptr<BloomFilter>>; | ||
| struct RowGroupBloomFilters { | ||
| RowGroupBloomFilters() = default; | ||
| RowGroupBloomFilters(RowGroupBloomFilters&&) noexcept = default; |
There was a problem hiding this comment.
Or we just keep using std::shared_ptr instead of std::unique_ptr. Is it important here?
| filter.GetBitsetSize()); | ||
| for (uint64_t hash : hashes) { | ||
| EXPECT_TRUE(filter.FindHash(hash)); | ||
| } |
There was a problem hiding this comment.
Should we check that most non-inserted values are not found, with an actual FPP value below kFpp?
| (static_cast<double>(num_blocks) * kBytesPerFilterBlock * 8); | ||
| const auto max_folds = static_cast<uint32_t>(std::countr_zero(num_blocks)); | ||
|
|
||
| if (avg_fill == 0.0) { |
There was a problem hiding this comment.
I little bit forgot would this really happens when writing a parquet file?
| const auto* bitset32 = reinterpret_cast<const uint32_t*>(data_->data()); | ||
| const uint32_t num_words = num_bytes_ / static_cast<uint32_t>(sizeof(uint32_t)); | ||
| for (uint32_t i = 0; i < num_words; ++i) { | ||
| total_set_bits += static_cast<uint64_t>(std::popcount(bitset32[i])); |
There was a problem hiding this comment.
I don't know whether internal::CountSetBits easy to understand here ( though popcount is right and a bit faster)
| BloomFilterBuilder, BloomFilterBuilderFoldingTest, | ||
| ::testing::Values(BloomFilterBuilderFoldingTestCase{.ndv = 1'000'000, | ||
| .fold = true, | ||
| .inserted_count = 1000, |
There was a problem hiding this comment.
Can we add a test for max fold (when inserted count == 0)?
Rationale for this change
This PR follows apache/arrow-rs#9628. It supports optimizing the disk usage of the Bloom filter. So specifying an ndv value larger than the actual value will not affect disk usage.
What changes are included in this PR?
BloomFilterBuilderwill try to fold the bloom filter before writing it to the output stream.Are these changes tested?
Yes.
Are there any user-facing changes?
Yes.
The type of
ndvinBloomFilterOptionsis changed fromint32_ttostd::optional<int64_t>. And the argument type ofOptimalNumOfBytesandOptimalNumOfBitsinBlockSplitBloomFilteris changed fromuint32_t ndvtouint64_t ndv